The intention of this project is to practice generalized linear regression skills on real data using the Bike Sharing Demand competition on Kaggle.
library(data.table)
library(pander)
library(ggplot2)
library(GGally)
library(gridExtra)
library(lubridate)
library(splines)
library(doMC)
library(MASS)
library(foreach)
library(parallel)
library(caret)
library(dplyr)The data was downloaded from Kaggle at https://www.kaggle.com/c/bike-sharing-demand/data on Jun 1, 2017. For the sake of this project, a 75-25 split will be used to validate the model.
bike <- fread("train.csv", na.strings = "", showProgress = FALSE) %>% tbl_df
bike_untrain <- bike
set.seed(123)
train <- sample_n(bike, size = nrow(bike) * 0.75)
naive <- train
test <- anti_join(bike, train)This data set was provided by Hadi Fanaee Tork using data from Capital Bikeshare.
The original assignment from Kaggle: “You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour by the test set, using only information available prior to the rental period.”
The bikeshare data set includes 10886 hourly periods that occurred between January 2011 and January 2013 in the Capital Bikeshare program in Washington, DC. Each hourly period contains a summary of environmental information during that hour including the weather and temperature, as well as information about the type of users making the rentals, and the features of the time of year when the hourly period occurred.
For this project, the desired outcome of the model will be to predict the total count of bikes rented in the hour.
Performance benchmark: The final model should have an RMSE less than 147.8 (number is derived from the naive linear model).
Relevance: Predicting total bike rentals is useful for effectively managing the Capital Bikeshare system. Managers should ensure the maximum number of bikes are available to the system if demand is expected to be high. The more successful bike rentals that occur, the more efficient and profitable the system will be. Features of the date and time of the rental are known well beforehand, and weather information can be known over a week in advance to set management expectations for the system.
The registered and casual features are removed since these are subcategories of the response variable, count, and, thus, would not be knowable before the hourly period.
train <- train %>% select(-registered, -casual)datetime to a datetime classConvert the datetime variable from a character vector to a datetime variable.
train$datetime <- ymd_hms(train$datetime)Convert multiple-category features to dummy variables for season and weather. The table function is used to make sure each category will have non-zero variability when split. All the categories except for the fourth category have sufficient variability. Using information from the codebook, this fourth category represents days with heavy precipitation. It will be combined with the third category to create a general precipitation category.
table(train$season)##
## 1 2 3 4
## 2003 2053 2048 2060
table(train$weather)##
## 1 2 3 4
## 5398 2128 637 1
train <- train %>% mutate(spring = ifelse(season == 1, 1, 0), summer = ifelse(season == 2, 1, 0), fall = ifelse(season == 3, 1, 0), winter = ifelse(season == 4, 1, 0)) %>% select(-season)
train <- train %>% mutate(clear_clouds = ifelse(weather == 1, 1, 0), mist = ifelse(weather == 2, 1, 0), lt_precip = ifelse(weather == 3 | weather == 4, 1, 0)) %>% select(-weather)This step scales the continuous variables and saves the center and scale information for scaling the test data later.
temp_train <- scale(train$temp)
atemp_train <- scale(train$atemp)
humidity_train <- scale(train$humidity)
windspeed_train <- scale(train$windspeed)
train <- train %>% mutate(temp = scale(temp)[,1], atemp = scale(atemp)[,1], humidity = scale(humidity)[,1], windspeed = scale(windspeed)[,1])Scaling will ensure that none of the fitting processes will overweight one variable due to differences in scale and that no issues with being caused if modeling algorithms involving Euclidean distances are used.
In this step, new features are added to the data using features already present in the data.
To separate the information present in the datetime feature, the feature is split into its separate parts. In addition, a factor dummy variable to indicate if the day is a weekend is also added.
train <- train %>% mutate(hour = hour(datetime), day = day(datetime), month = month(datetime), year = year(datetime), weekendday = ifelse(wday(datetime) %in% c(1, 7), 1, 0), day_week = wday(datetime)) %>% select(-datetime)Ordinary Least Squares (OLS) Regression assumes a normally distributed response variable. However, this analysis involves a count response variable. Count variables are usually not normally distributed and require an assumption of a Poisson distribution to model correctly.
ggplot(train, aes(x = count)) +
geom_density() +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Density of `count`", y = "P(`count`)", x = "`count`")As the density plot shows, the count variable is not normally distributed, thus OLS will not account for the appropriate mean-variance relationships in the data.
Comparing different models will all the features included allows one to see the different outcomes in using a linear regression versus a Poisson regression. First, a linear model is fit:
fit_lm <- lm(count ~ temp + atemp + humidity + windspeed + hour + day + month + year + holiday + workingday + spring + summer + winter + clear_clouds + mist + lt_precip + day_week, train)Then, a generalized linear model using the Poisson distribution is fit:
fit_glm <- glm(count ~ temp + atemp + humidity + windspeed + hour + day + month + year + holiday + workingday + spring + summer + winter + clear_clouds + mist + lt_precip + day_week, data = train, family = "poisson")
summary(fit_glm)##
## Call:
## glm(formula = count ~ temp + atemp + humidity + windspeed + hour +
## day + month + year + holiday + workingday + spring + summer +
## winter + clear_clouds + mist + lt_precip + day_week, family = "poisson",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -25.257 -8.401 -2.795 3.876 36.567
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.932e+02 3.327e+00 -268.492 < 2e-16 ***
## temp 1.450e-01 4.555e-03 31.837 < 2e-16 ***
## atemp 2.020e-01 4.388e-03 46.036 < 2e-16 ***
## humidity -1.905e-01 1.024e-03 -186.049 < 2e-16 ***
## windspeed 4.264e-02 8.552e-04 49.857 < 2e-16 ***
## hour 4.560e-02 1.335e-04 341.627 < 2e-16 ***
## day 4.498e-03 1.462e-04 30.768 < 2e-16 ***
## month 7.281e-02 9.946e-04 73.204 < 2e-16 ***
## year 4.458e-01 1.654e-03 269.579 < 2e-16 ***
## holiday -1.237e-02 5.155e-03 -2.400 0.016402 *
## workingday 6.381e-03 1.774e-03 3.597 0.000322 ***
## spring 3.070e-01 6.860e-03 44.748 < 2e-16 ***
## summer 2.889e-01 3.637e-03 79.425 < 2e-16 ***
## winter 1.770e-01 4.246e-03 41.689 < 2e-16 ***
## clear_clouds 2.277e-01 3.990e-03 57.078 < 2e-16 ***
## mist 2.754e-01 4.052e-03 67.961 < 2e-16 ***
## lt_precip NA NA NA NA
## day_week 9.949e-03 4.038e-04 24.637 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1344509 on 8163 degrees of freedom
## Residual deviance: 752090 on 8147 degrees of freedom
## AIC: 804499
##
## Number of Fisher Scoring iterations: 5
One of the important assumptions of Poisson regression is that the dispersion parameter is about one. If this assumption is violated, the Poisson regression will not adequately provide for overdispersed count data. From the summary of fit_glm above, the dispersion parameter, calculated as the Residual Deviance divided by the degrees of freedom (df), is 752090/8147, or 92.3, which is much larger than one. Thus, a Poisson regression does not account for overdispersion in the data and will not model the mean-variance relationships in the data appropriately.
A negative binomial regression includes an extra parameter, theta, to allow the model to account for overdispersion. The model is fit using the glm.nb function from the MASS package.
fit_nb <- glm.nb(count ~ temp + atemp + humidity + windspeed + hour + day + month + year + holiday + workingday + spring + summer + winter + clear_clouds + mist + lt_precip + day_week, data = train)
summary(fit_nb)##
## Call:
## glm.nb(formula = count ~ temp + atemp + humidity + windspeed +
## hour + day + month + year + holiday + workingday + spring +
## summer + winter + clear_clouds + mist + lt_precip + day_week,
## data = train, init.theta = 1.282417604, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9683 -0.9741 -0.2606 0.3221 3.6884
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.611e+02 3.996e+01 -24.049 < 2e-16 ***
## temp 1.362e-01 6.104e-02 2.232 0.025611 *
## atemp 1.819e-01 5.810e-02 3.131 0.001739 **
## humidity -1.550e-01 1.244e-02 -12.459 < 2e-16 ***
## windspeed 3.686e-02 1.082e-02 3.406 0.000658 ***
## hour 7.286e-02 1.529e-03 47.643 < 2e-16 ***
## day 3.231e-03 1.798e-03 1.797 0.072310 .
## month 6.537e-02 1.222e-02 5.351 8.75e-08 ***
## year 4.794e-01 1.987e-02 24.128 < 2e-16 ***
## holiday -2.167e-02 6.298e-02 -0.344 0.730770
## workingday 6.300e-02 2.179e-02 2.892 0.003834 **
## spring 2.264e-01 8.447e-02 2.680 0.007355 **
## summer 2.691e-01 4.763e-02 5.650 1.61e-08 ***
## winter 1.560e-01 5.433e-02 2.871 0.004092 **
## clear_clouds 3.060e-01 4.084e-02 7.493 6.71e-14 ***
## mist 3.883e-01 4.136e-02 9.388 < 2e-16 ***
## lt_precip NA NA NA NA
## day_week 1.196e-02 4.936e-03 2.424 0.015356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.2824) family taken to be 1)
##
## Null deviance: 14040.6 on 8163 degrees of freedom
## Residual deviance: 9173.3 on 8147 degrees of freedom
## AIC: 98247
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.2824
## Std. Err.: 0.0187
##
## 2 x log-likelihood: -98210.5900
The model assumes a theta of 1.28, yielding a dispersion parameter of 9173.3/8148, or 1.13, which is close to one. A negative binomial regression will adequately account for the mean-variance relationships in the data, as the theta parameter will allow the model to account for the overdispersed count variable.
As a final check, the actual count values versus each model’s predictions is plotted to compare each model’s fit. In addition, the residual versus fitted plots are also included to compare how each model is fitting the data.
resid_lm <- data.frame(act = train$count, pred = predict(fit_lm, type = "response")) %>% mutate(resid = act - pred)
lm_fit <- ggplot(resid_lm, aes(x = pred, y = act)) +
geom_point(color = "#0068D6", alpha = 0.6) +
geom_abline(slope=1, intercept=0, col = "black") +
ylim(c(0,800)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Actual vs Fitted Plot of a Linear Regression", y = "Actual", x = "Fitted")
lm_resid <- ggplot(resid_lm, aes(x = pred, y = resid)) +
geom_point(color = "#0068D6", alpha = 0.6) +
stat_smooth(method="lm", color = "red", formula = y ~ ns(x,6), se = FALSE) +
geom_hline(yintercept = 0) + ylim(c(-600,600)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs Fitted Plot of a Linear Regression", y = "Residuals", x = "Fitted")
resid_glm <- data.frame(act = train$count, pred = predict(fit_glm, type = "response")) %>% mutate(resid = act - pred)
glm_fit <- ggplot(resid_glm, aes(x = pred, y = act)) +
geom_point(color = "#0068D6", alpha = 0.6) +
geom_abline(slope=1, intercept=0, col = "black") +
ylim(c(0,800)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Actual vs Fitted Plot of a Poisson Regression", y = "Actual", x = "Fitted")
glm_resid <- ggplot(resid_glm, aes(x = pred, y = resid)) +
geom_point(color = "#0068D6", alpha = 0.6) +
stat_smooth(method="lm", color = "red", formula = y ~ ns(x,6), se = FALSE) +
geom_hline(yintercept = 0) +
ylim(c(-600,600)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs Fitted Plot of a Poisson Regression", y = "Residuals", x = "Fitted")
resid_nb <- data.frame(act = train$count, pred = predict(fit_nb, type = "response")) %>% mutate(resid = act - pred)
nb_fit <- ggplot(resid_nb, aes(x = pred, y = act)) +
geom_point(color = "#0068D6", alpha = 0.6) +
geom_abline(slope=1, intercept=0, col = "black") +
ylim(c(0,800)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Actual vs Fitted Plot of a Negative Binomial Regression", y = "Actual", x = "Fitted")
nb_resid <- ggplot(resid_nb, aes(x = pred, y = act)) +
geom_point(color = "#0068D6", alpha = 0.6) +
stat_smooth(method="lm", color = "red", formula = y ~ ns(x,6), se = FALSE) +
geom_hline(yintercept = 0) +
ylim(c(-600,600)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs Fitted Plot of a Negative Binomial Regression", y = "Residuals", x = "Fitted")
grid.arrange(lm_fit, lm_resid, glm_fit, glm_resid, nb_fit, nb_resid, ncol = 2)These plots illustrate how each model type fits the mean-variance relationships in the data. With count data, the variance of predictions tends to increase the greater the amount predicted causing a fan shape to appear in the plot. The Linear Regression plots do show this fan shape as there is less variance in the residuals at lower fitted values compared to higher values. There is little bias present in the fit as the residuals are centered around zero. The heteroskedasticity in the data is not horrible, but when compared to the generalized linear regression options, the difference in variability for the glm options are greatly reduced, yielding a model that more appropriately models the variance in the data.
For instance, the Poisson regression and, especially, the negative binomial regression do not show this fan shape as their fitted values increase. The negative binomial regression’s variance modeling is very even as the fitted values increased. However, there is a clear negative bias in this model as it consistently underestimates the actual count values. It is important to remember that no model is correct, and it is often up to each modeler’s preferences to decide how important a model’s prediction bias is to its prediction variance. In this case, since these are not the final model options, the model with the best variance fit will be used going forward as issues with bias could be corrected with better feature engineering and transformations once the proper regression type is assumed. Thus, the negative binomial regression will be used for all modeling going forward.
Exploratory data analysis will be used to tune the features in the data. Thus, a list of ten data folds will be constructed using the custom klist function from my K-Fold Cross Validation Helper Functions (https://github.com/cojamalo/K-Fold-CV-Helper-Functions), which creates ten non-overlapping train-test splits of an input data set for use in using cross validation to estimate prediction error.
klist_train <- klist(train)The new features relating to the time when the bike was rented are plotted versus count to check their negative binomial fit:
registerDoMC() #doMC
plot_var <- list("hour", "day", "day_week", "month")
plots <- foreach(i = seq(from = 1, to = length(plot_var), by = 1), .packages = c("MASS"), .combine='c') %dopar%
{
form1 <- as.formula(paste0("count ~ ", plot_var[[i]]))
form2 <- as.formula(paste0("count ~ ns(", plot_var[[i]], ",2)"))
form3 <- as.formula(paste0("count ~ ns(", plot_var[[i]], ",3)"))
fit_lm <- glm.nb(form1, train); pred_lm <- predict(fit_lm, type="response")
fit_sq <- glm.nb(form2, train); pred_sq <- predict(fit_sq, type="response")
fit_cb <- glm.nb(form3, train); pred_cb <- predict(fit_cb, type="response")
plot <- ggplot(train, aes(x = train[,plot_var[[i]]], y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = train[,plot_var[[i]]], y = pred_lm), color = "blue", lwd = 1) +
geom_line(aes(x = train[,plot_var[[i]]], y = pred_sq), color = "red", lwd = 1) +
geom_line(aes(x = train[,plot_var[[i]]], y = pred_cb), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = paste0("Number of Bikes Rented by ", plot_var[[i]]), y = "Count of Bikes Rented", x = plot_var[[i]])
return(list(plot))
}
grid.arrange(plots[[1]],plots[[2]],plots[[3]], plots[[4]], ncol = 2)The blue line represents the predicted counts under the negative binomial regression. In addition, two splines are added to compare the basic fit with non-linear fits using cubic splines. In this case, splines with two (red) and three (green) degrees of freedom are shown.
hour feature splittingIt is clear from the plot that as the hour of day increases, the count of bikes varies in a way that is not predicted well by the basic negative binomial fit.
fit <- glm.nb(count ~ hour, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ hour + ns(hour, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ hour + ns(hour, 12), train); pred_ns <- predict(fit_ns, type="response")
ggplot(train, aes(x = hour, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = hour, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = hour, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = hour, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Hour", y = "Count of Bikes Rented", x = "Hour")From midnight to around 4 am, bike rentals appear to decrease, then from about 5 am to around 5:30 pm, bike rentals increase, after which they again decrease. The green line in this plot shows a cubic spline with twelve degrees of freedom and how it captures the changing number of bike rentals throughout the day. While a spline could capture much of this non-linear relationship, another alternative is to split the hours into their own features so the regression can fit a model for each hour versus the rest. This is possible to do since the hour variable is discrete, not continuous.
The following code creates a dummy variable for each unique hour and fits a model using the new features.
train <- train %>% mutate(hr_0 = ifelse(hour == 0, 1, 0), hr_1 = ifelse(hour == 1, 1, 0), hr_2 = ifelse(hour == 2, 1, 0), hr_3 = ifelse(hour == 3, 1, 0), hr_4 = ifelse(hour == 4, 1, 0), hr_5 = ifelse(hour == 5, 1, 0), hr_6 = ifelse(hour == 6, 1, 0), hr_7 = ifelse(hour == 7, 1, 0), hr_8 = ifelse(hour == 8, 1, 0), hr_9 = ifelse(hour == 9, 1, 0), hr_10 = ifelse(hour == 10, 1, 0), hr_11 = ifelse(hour == 11, 1, 0), hr_12 = ifelse(hour == 12, 1, 0), hr_13 = ifelse(hour == 13, 1, 0), hr_14 = ifelse(hour == 14, 1, 0), hr_15 = ifelse(hour == 15, 1, 0), hr_16 = ifelse(hour == 16, 1, 0), hr_17 = ifelse(hour == 17, 1, 0), hr_18 = ifelse(hour == 18, 1, 0), hr_19 = ifelse(hour == 19, 1, 0), hr_20 = ifelse(hour == 20, 1, 0), hr_21 = ifelse(hour == 21, 1, 0), hr_22 = ifelse(hour == 22, 1, 0), hr_23 = ifelse(hour == 23, 1, 0))
fit_sep <- glm.nb(count ~ hour + hr_0 + hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + hr_23, train)The residual vs fitted plots give a quick comparison between the fits of the original model with just the hour and the hour separated model.
plot(fit, which =1)plot(fit_sep, which =1)The bias and variance issues caused by the non-linear hour-count relationship have been reduced significantly with the separation of the hour variable.
day feature checkNext, the day feature is checked.
fit <- glm.nb(count ~ day, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ ns(day, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ ns(day, 12), train); pred_ns <- predict(fit_ns, type="response")
ggplot(train, aes(x = day, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = day, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = day, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = day, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Day of the Month", y = "Count of Bikes Rented", x = "Day of the Month")The relationship between day of the month and the number of bikes rented is very weak. This feature will remain untouched.
plot(fit, which = 1)day_week feature checkThe day_week feature is checked.
fit <- glm.nb(count ~ day_week, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ ns(day_week, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ ns(day_week, 6), train); pred_ns <- predict(fit_ns, type="response")
ggplot(train, aes(x = day_week, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = day_week, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = day_week, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = day_week, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Day of the Week", y = "Count of Bikes Rented", x = "Day of the Week")The relationship between day_week of the month and the number of bikes rented is very weak. This feature will remain untouched.
plot(fit, which = 1)month feature checkFinally, the month feature is checked to ensure it is ready for modeling.
fit <- glm.nb(count ~ month, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ month + ns(month, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ month + ns(month, 2), train); pred_ns <- predict(fit_ns, type="response")
ggplot(train, aes(x = month, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = month, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = month, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = month, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Month", y = "Count of Bikes Rented", x = "Month")From this plot, it appears there is a relationship between which month bikes are rented in and how many bikes are rented. The spline with two degrees of freedom captures this relationship well as more bikes are rented in the summer months than the winter months.
To ensure the spline with two degrees of freedom does result in the best predictions, an optimization check for degrees of freedom is conducted. Splines of differing degrees of freedom as fitted and a root mean squared error (RMSE) estimate is taken for the resulting model using a 10-Fold Cross Validation for the training data.
registerDoMC() #doMC
rmses <- foreach(i = seq(from = 1, to = 6, by = 1), .packages = c("MASS"), .combine='c') %dopar%
{
eval <- colMeans(validate_glm_nb(klist_train, count ~ month + ns(month, i)))
return(eval)
}
output <- data.frame(df = seq(from = 1, to = 6, by = 1), KFOLD_RMSE = NA)
output$KFOLD_RMSE <- rmses
pandoc.table(output)| df | KFOLD_RMSE |
|---|---|
| 1 | 179.4 |
| 2 | 174.7 |
| 3 | 174.6 |
| 4 | 174.5 |
| 5 | 174.5 |
| 6 | 174.5 |
ggplot(output, aes(x = df, y = KFOLD_RMSE)) + geom_point() +
geom_vline(xintercept = 2) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "10-FOLD RMSE Estimate by df of `month` Spline", y = "10-FOLD RMSE Estimate", x = "Degrees of Freedom of Spline")There is clearly a large reduction in prediction error for a spline with two degrees of freedom. Any additional degrees of freedom do not significantly reduce error, so a two df spline is selected.
The residual versus fitted plots are compared to view the results of the added spline:
plot(fit, which =1)plot(fit_ns, which =1)The model including the 2 df spline has less variation in bias than the model excluding the spline, so the spline will be included in future models.
Now that all the discrete and factor variables have been addressed, the remaining continuous variables must be checked.
registerDoMC() #doMC
plot_var <- list("temp", "atemp", "humidity", "windspeed")
plots <- foreach(i = seq(from = 1, to = 4, by = 1), .packages = c("MASS"), .combine='c') %dopar%
{
form1 <- as.formula(paste0("count ~ ", plot_var[[i]]))
form2 <- as.formula(paste0("count ~ ns(", plot_var[[i]], ",2)"))
form3 <- as.formula(paste0("count ~ ns(", plot_var[[i]], ",3)"))
fit_lm <- glm.nb(form1, train); pred_lm <- predict(fit_lm, type="response")
fit_sq <- glm.nb(form2, train); pred_sq <- predict(fit_sq, type="response")
fit_cb <- glm.nb(form3, train); pred_cb <- predict(fit_cb, type="response")
plot <- ggplot(train, aes(x = train[,plot_var[[i]]], y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = train[,plot_var[[i]]], y = pred_lm), color = "blue", lwd = 1) +
geom_line(aes(x = train[,plot_var[[i]]], y = pred_sq), color = "red", lwd = 1) +
geom_line(aes(x = train[,plot_var[[i]]], y = pred_cb), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = paste0("Number of Bikes Rented by ", plot_var[[i]]), y = "Count of Bikes Rented", x = plot_var[[i]])
return(list(plot))
}
grid.arrange(plots[[1]],plots[[2]],plots[[3]],plots[[4]], ncol = 2) The blue line represents the predicted counts under the negative binomial regression. In addition, two splines are added to compare the basic fit with non-linear fits using cubic splines. In this case, splines with two (red) and three (green) degrees of freedom are shown.
temp feature checkThe first variable to check is temp.
fit <- glm.nb(count ~ temp, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ temp + ns(temp, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ temp + ns(temp, 3), train); pred_ns <- predict(fit_ns, type="response")
scatter <- ggplot(train, aes(x = temp, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = temp, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = temp, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = temp, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Scaled Temperature", y = "Count of Bikes Rented", x = "Scaled Temperature")
density <- ggplot(train, aes(x = temp)) +
geom_density(fill = "black",color = "black", alpha = 0.6) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Density of Scaled Temperature", y = "P(Scaled Temperature)", x = "Scaled Temperature")
grid.arrange(scatter, density, ncol = 2)A scatterplot and density plot are included to evaluate the relationship between temp and count. In this case, the simple negative binomial regression appears to fit the data well.
A quick check of the residual versus fitted plot confirms this assessment as the residual variance and bias is acceptable.
plot(fit, which = 1)atemp feature checkHypothetically, atemp should have a similar relationship to count as temp.
fit <- glm.nb(count ~ atemp, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ ns(atemp, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ ns(atemp, 7), train); pred_ns <- predict(fit_ns, type="response")
scatter <- ggplot(train, aes(x = atemp, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = atemp, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = atemp, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = atemp, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Scaled Aparent Temp", y = "Count of Bikes Rented", x = "Scaled Aparent Temp")
density <- ggplot(train, aes(x = atemp)) +
geom_density(fill = "black",color = "black", alpha = 0.6) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Density of Scaled Aparent Temperature", y = "P(Scaled Aparent Temp)", x = "Scaled Aparent Temp")
grid.arrange(scatter, density,ncol = 2)Again, the simple negative binomial regression appears to fit best.
A quick check of the residual versus fitted plot confirms this assessment as the residual variance and bias is acceptable.
plot(fit, which = 1)humidity feature checkThe next feature variable is humidity.
fit <- glm.nb(count ~ humidity, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ humidity + ns(humidity, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ humidity + ns(humidity, 3), train); pred_ns <- predict(fit_ns, type="response")
scatter <- ggplot(train, aes(x = humidity, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = humidity, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = humidity, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = humidity, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Scaled Humidity", y = "Count of Bikes Rented", x = "Scaled Humidity")
density <- ggplot(train, aes(x = humidity)) +
geom_density(fill = "black",color = "black", alpha = 0.6) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Density of Scaled Humidity", y = "P(Scaled Humidity)", x = "Scaled Humidity")
grid.arrange(scatter, density,ncol = 2)It is unclear why there is a grouping of low counts at the minimum humidity values. Otherwise, the 2 df spline appears to fit the data well.
To ensure the spline with two degrees of freedom does result in the best predictions, an optimization check for degrees of freedom is conducted. Splines of differing degrees of freedom as fitted and a root mean squared error (RMSE) estimate is taken for the resulting model using a 10-Fold Cross Validation for the training data.
registerDoMC() #doMC
rmses <- foreach(i = seq(from = 1, to = 6, by = 1), .packages = c("MASS"), .combine='c') %dopar%
{
eval <- colMeans(validate_glm_nb(klist_train, count ~ ns(humidity, i)))
return(eval)
}
output <- data.frame(df = seq(from = 1, to = 6, by = 1), KFOLD_RMSE = NA)
output$KFOLD_RMSE <- rmses
pandoc.table(output)| df | KFOLD_RMSE |
|---|---|
| 1 | 172.9 |
| 2 | 171.5 |
| 3 | 171.6 |
| 4 | 171.3 |
| 5 | 171.1 |
| 6 | 171 |
ggplot(output, aes(x = df, y = KFOLD_RMSE)) +
geom_point() +
geom_vline(xintercept = 2) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "10-FOLD RMSE Estimate by df of `humidity` Spline", y = "10-FOLD RMSE Estimate", x = "Degrees of Freedom of Spline")There is a significant improvement in RMSE with the 2 df spline included.
Finally, the residual vs fitted plot is checked.
plot(fit, which = 1)plot(fit_lo, which = 1)The residuals for the 2 df spline exhibit better bias and variance than the original fit.
windspeed feature tuning with splinesFinally, the windspeed variable is checked.
fit <- glm.nb(count ~ windspeed, train); pred <- predict(fit, type="response")
fit_lo <- glm.nb(count ~ windspeed + ns(windspeed, 2), train); pred_lo <- predict(fit_lo, type="response")
fit_ns <- glm.nb(count ~ windspeed + ns(windspeed, 3), train); pred_ns <- predict(fit_ns, type="response")
scatter <- ggplot(train, aes(x = windspeed, y = count)) +
geom_point(alpha = 0.2) +
geom_line(aes(x = windspeed, y = pred), color = "blue", lwd = 1) +
geom_line(aes(x = windspeed, y = pred_lo), color = "red", lwd = 1) +
geom_line(aes(x = windspeed, y = pred_ns), color = "green", lwd = 1) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of Bikes Rented by Scaled Windspeed", y = "Count of Bikes Rented", x = "Scaled Windspeed")
density <- ggplot(train, aes(x = windspeed)) +
geom_density(fill = "black",color = "black", alpha = 0.6) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Density of Scaled Windspeed", y = "P(Scaled Windspeed)", x = "Scaled Windspeed")
grid.arrange(scatter, density,ncol = 2)Again, the 2 df spline appears to fit the data well.
To ensure the spline with two degrees of freedom does result in the best predictions, an optimization check for degrees of freedom is conducted. Splines of differing degrees of freedom as fitted and a root mean squared error (RMSE) estimate is taken for the resulting model using a 10-Fold Cross Validation for the training data.
registerDoMC() #doMC
rmses <- foreach(i = seq(from = 1, to = 6, by = 1), .packages = c("MASS"), .combine='c') %dopar%
{
eval <- colMeans(validate_glm_nb(klist_train, count ~ windspeed + ns(windspeed, i)))
return(eval)
}
output <- data.frame(df = seq(from = 1, to = 6, by = 1), KFOLD_RMSE = NA)
output$KFOLD_RMSE <- rmses
pandoc.table(output)| df | KFOLD_RMSE |
|---|---|
| 1 | 180.4 |
| 2 | 179.9 |
| 3 | 179.9 |
| 4 | 179.7 |
| 5 | 179.7 |
| 6 | 179.7 |
ggplot(output, aes(x = df, y = KFOLD_RMSE)) +
geom_point() +
geom_vline(xintercept = 2) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "10-FOLD RMSE Estimate by df of `windspeed` Spline", y = "10-FOLD RMSE Estimate", x = "Degrees of Freedom of Spline")There is a significant improvement in RMSE with the 2 df spline included.
Finally, the residual vs fitted plot is checked.
klist_train <- klist(train)
plot(fit, which = 1)plot(fit_lo, which = 1)The residuals for the 2 df spline exhibit better bias and variance than the original fit.
Before selecting for the best features, all changes made to the training data set are applied to the testing data set.
# Apply all variable transformations to the test set
test <- test %>% select(-registered, -casual)
test$datetime <- ymd_hms(test$datetime)
test <- test %>% mutate(spring = ifelse(season == 1, 1, 0), summer = ifelse(season == 2, 1, 0), fall = ifelse(season == 3, 1, 0), winter = ifelse(season == 4, 1, 0)) %>% select(-season)
test <- test %>% mutate(clear_clouds = ifelse(weather == 1, 1, 0), mist = ifelse(weather == 2, 1, 0), lt_precip = ifelse(weather == 3 | weather == 4, 1, 0)) %>% select(-weather)
test <- test %>% mutate(hour = hour(datetime), day = day(datetime), month = month(datetime), year = year(datetime), weekendday = ifelse(wday(datetime) %in% c(1, 7), 1, 0), day_week = wday(datetime)) %>% select(-datetime)
test <- test %>% mutate(hr_0 = ifelse(hour == 0, 1, 0), hr_1 = ifelse(hour == 1, 1, 0), hr_2 = ifelse(hour == 2, 1, 0), hr_3 = ifelse(hour == 3, 1, 0), hr_4 = ifelse(hour == 4, 1, 0), hr_5 = ifelse(hour == 5, 1, 0), hr_6 = ifelse(hour == 6, 1, 0), hr_7 = ifelse(hour == 7, 1, 0), hr_8 = ifelse(hour == 8, 1, 0), hr_9 = ifelse(hour == 9, 1, 0), hr_10 = ifelse(hour == 10, 1, 0), hr_11 = ifelse(hour == 11, 1, 0), hr_12 = ifelse(hour == 12, 1, 0), hr_13 = ifelse(hour == 13, 1, 0), hr_14 = ifelse(hour == 14, 1, 0), hr_15 = ifelse(hour == 15, 1, 0), hr_16 = ifelse(hour == 16, 1, 0), hr_17 = ifelse(hour == 17, 1, 0), hr_18 = ifelse(hour == 18, 1, 0), hr_19 = ifelse(hour == 19, 1, 0), hr_20 = ifelse(hour == 20, 1, 0), hr_21 = ifelse(hour == 21, 1, 0), hr_22 = ifelse(hour == 22, 1, 0), hr_23 = ifelse(hour == 23, 1, 0))
test <- test %>% mutate(temp = scale(temp, attr(temp_train, "scaled:center"),attr(temp_train, "scaled:scale"))[,1], atemp = scale(atemp, attr(atemp_train, "scaled:center"),attr(atemp_train, "scaled:scale"))[,1], humidity = scale(humidity, attr(humidity_train, "scaled:center"),attr(humidity_train, "scaled:scale"))[,1], windspeed = scale(windspeed, attr(windspeed_train, "scaled:center"),attr(windspeed_train, "scaled:scale"))[,1])First, the training data set is checked for any variables that have near zero variance.
names(train)[nearZeroVar(train)]## [1] "holiday" "hr_0" "hr_1" "hr_2" "hr_3" "hr_4" "hr_5"
## [8] "hr_6" "hr_7" "hr_8" "hr_9" "hr_10" "hr_11" "hr_12"
## [15] "hr_13" "hr_14" "hr_15" "hr_16" "hr_17" "hr_18" "hr_19"
## [22] "hr_20" "hr_21" "hr_22" "hr_23"
table(train$holiday)##
## 0 1
## 7941 223
The holiday variable and individual hour variables have low variance, but they each have enough representation in each category to keep them included in the data set.
The first model fit is the negative binomial model with all features included.
fit1 <- glm.nb(count ~ . + ns(month, 2) + ns(humidity, 2) + ns(windspeed,2), train)
summary(fit1)##
## Call:
## glm.nb(formula = count ~ . + ns(month, 2) + ns(humidity, 2) +
## ns(windspeed, 2), data = train, init.theta = 3.966916422,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6086 -0.7125 -0.0747 0.4955 4.7092
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.773e+02 2.362e+01 -41.370 < 2e-16 ***
## holiday -1.303e-01 3.692e-02 -3.528 0.000418 ***
## workingday -2.169e-01 1.276e-02 -17.000 < 2e-16 ***
## temp 8.774e-02 3.699e-02 2.372 0.017699 *
## atemp 1.217e-01 3.403e-02 3.576 0.000349 ***
## humidity -5.862e-02 8.142e-03 -7.200 6.04e-13 ***
## windspeed -3.389e-02 1.516e-02 -2.235 0.025418 *
## spring 9.865e-03 6.606e-02 0.149 0.881284
## summer 7.641e-02 4.963e-02 1.539 0.123687
## fall -1.243e-01 3.453e-02 -3.599 0.000319 ***
## winter NA NA NA NA
## clear_clouds 4.281e-01 2.530e-02 16.918 < 2e-16 ***
## mist 3.819e-01 2.522e-02 15.144 < 2e-16 ***
## lt_precip NA NA NA NA
## hour -3.695e-01 3.972e-02 -9.304 < 2e-16 ***
## day 2.861e-03 1.055e-03 2.711 0.006704 **
## month 3.811e-02 7.854e-03 4.852 1.22e-06 ***
## year 4.916e-01 1.173e-02 41.901 < 2e-16 ***
## weekendday NA NA NA NA
## day_week 1.439e-02 2.896e-03 4.970 6.70e-07 ***
## hr_0 -9.012e+00 8.942e-01 -10.078 < 2e-16 ***
## hr_1 -9.171e+00 8.546e-01 -10.731 < 2e-16 ***
## hr_2 -9.132e+00 8.150e-01 -11.205 < 2e-16 ***
## hr_3 -9.404e+00 7.754e-01 -12.128 < 2e-16 ***
## hr_4 -9.605e+00 7.359e-01 -13.052 < 2e-16 ***
## hr_5 -8.072e+00 6.960e-01 -11.598 < 2e-16 ***
## hr_6 -6.310e+00 6.563e-01 -9.615 < 2e-16 ***
## hr_7 -4.935e+00 6.166e-01 -8.004 1.21e-15 ***
## hr_8 -4.034e+00 5.768e-01 -6.993 2.70e-12 ***
## hr_9 -4.220e+00 5.371e-01 -7.856 3.96e-15 ***
## hr_10 -4.170e+00 4.975e-01 -8.382 < 2e-16 ***
## hr_11 -3.691e+00 4.578e-01 -8.063 7.47e-16 ***
## hr_12 -3.114e+00 4.182e-01 -7.447 9.58e-14 ***
## hr_13 -2.754e+00 3.786e-01 -7.273 3.51e-13 ***
## hr_14 -2.454e+00 3.391e-01 -7.236 4.61e-13 ***
## hr_15 -2.034e+00 2.996e-01 -6.789 1.13e-11 ***
## hr_16 -1.432e+00 2.602e-01 -5.502 3.75e-08 ***
## hr_17 -6.114e-01 2.209e-01 -2.768 0.005643 **
## hr_18 -3.181e-01 1.818e-01 -1.750 0.080121 .
## hr_19 -2.862e-01 1.429e-01 -2.003 0.045155 *
## hr_20 -2.165e-01 1.047e-01 -2.067 0.038716 *
## hr_21 -9.893e-02 6.814e-02 -1.452 0.146493
## hr_22 NA NA NA NA
## hr_23 NA NA NA NA
## ns(month, 2)1 7.482e-01 1.140e-01 6.563 5.26e-11 ***
## ns(month, 2)2 NA NA NA NA
## ns(humidity, 2)1 6.790e-01 9.517e-02 7.134 9.73e-13 ***
## ns(humidity, 2)2 NA NA NA NA
## ns(windspeed, 2)1 6.800e-02 8.810e-02 0.772 0.440197
## ns(windspeed, 2)2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.9669) family taken to be 1)
##
## Null deviance: 41241 on 8163 degrees of freedom
## Residual deviance: 8756 on 8122 degrees of freedom
## AIC: 88561
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.9669
## Std. Err.: 0.0666
##
## 2 x log-likelihood: -88474.5210
To assess the prediction error of this model, the RMSE is calculated using the custom cross validation functions from my K-Fold Cross Validation Helper Functions, which creates ten non-overlapping train-test splits of an input data set for use in using cross validation to estimate prediction error.
rmse_basic <- colMeans(validate_glm_nb(klist_train, count ~ . + ns(month, 2) + ns(humidity, 2) + ns(windspeed,2))) # 100This model has an estimated RMSE of 100 bikes.
The first feature selection strategy is used is a greedy forward-step selection where the features are added one at a time and the feature that lowers the RMSE the most at each step is selected and the then the process is rerun until adding more features no longer lowers the RMSE.
test_val <- function(formula) {
fit <- glm.nb(as.formula(formula), train)
pred <- predict(fit, test, type = "response")
resid <- test$count - pred
return(sqrt(mean(resid^2)))
}
one_step <- function(base_function, data) {
# RMSE
RMSE <- colMeans(validate_glm_nb(klist_train, as.formula(base_function)))
# Extract variables from the formula
params <- trimws(strsplit(base_function, split = "~")[[1]])
# Construct base vector from the formula
base <- params[2]
if (grepl("[+]", params[2])) {
base <- trimws(strsplit(params[2], split = "[+]")[[1]])
}
predictors <- length(base)
test_RMSE <- test_val(base_function)
output <- data.frame(model = base_function, terms = predictors, RMSE = RMSE, test_RMSE = test_RMSE)
# extract response variable
variables <- names(data)
variables <- variables[!grepl(params[1], variables)]
variables <- c(variables, "ns(month, 2)", "ns(humidity, 2)", "ns(windspeed,2)")
# Construct expand_args from base
expand_args <- c()
for (item in base) {
expand_args <- c(expand_args, list(item))
}
expand_args <- c(expand_args, list(variables))
# Construct the combination data frame
current <- expand.grid(expand_args, stringsAsFactors = FALSE)
# This checks for any rows that are duplicates of each other:
duplicates <- duplicated(apply(apply( current, 1, sort), 2 , paste , collapse = ""))
current <- current[!duplicates,]
# This removes any rows with the same variable in more than one column
key <- colSums(apply(apply( current, 1, as.character), 2 , duplicated))
current <- current[key == 0,]
# Construct formulas vector
formulas <- apply(current, 1 , paste , collapse = " + ")
formulas <- paste0(params[1], " ~ ", formulas)
# Runs validate for each string of a vector of formula strings
registerDoMC() #doMC
new <- foreach(i = seq(from = 1, to = length(formulas), by = 1), .packages = c("MASS"), .combine='rbind') %dopar%
{
RMSE <- colMeans(validate_glm_nb(klist_train, as.formula(formulas[i])))
# Extract variables from the formula
params <- trimws(strsplit(formulas[i], split = "~")[[1]])
# Construct base vector from the formula
base <- params[2]
if (grepl("[+]", params[2])) {
base <- trimws(strsplit(params[2], split = "[+]")[[1]])
}
predictors <- length(base)
test_RMSE <- test_val(formulas[i])
output <- data.frame(model = formulas[i], terms = predictors, RMSE = RMSE, test_RMSE = test_RMSE)
return(output[1,])
}
output <- rbind(output, new) %>% arrange(RMSE)
return(output)
}
previous_best_formula <- "count ~ 1"
run <- one_step(previous_best_formula, train)
runs <- run
new_score <- run$RMSE - 1
best_score <- run$RMSE
while (new_score < best_score) {
best_score <- new_score
best_formula <- run$model
run <- one_step(as.character(best_formula), train)
new_score <- run$RMSE
runs <- rbind(runs, run)
}
results <- runs %>% group_by(terms) %>% top_n(1) %>% arrange(RMSE) %>% select(-test_RMSE)
pandoc.table(results)| model | terms | RMSE |
|---|---|---|
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + mist + workingday | 30 | 98.85100 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + workingday | 29 | 98.92604 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + mist + hr_14 + workingday | 31 | 99.00300 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + weekendday | 28 | 99.10054 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + weekendday | 27 | 99.30835 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + weekendday | 26 | 100.55224 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + weekendday | 24 | 100.65066 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + weekendday | 25 | 100.66310 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + weekendday | 23 | 101.46865 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + weekendday | 22 | 101.59723 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + weekendday | 21 | 102.70201 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + weekendday | 20 | 104.08084 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + workingday | 19 | 105.16557 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + workingday | 18 | 107.90414 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + workingday | 17 | 111.58021 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + workingday | 16 | 113.03645 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + workingday | 15 | 114.34806 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + workingday | 14 | 116.45280 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + weekendday | 13 | 118.58824 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + weekendday | 12 | 123.43176 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + workingday | 11 | 124.60806 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_9 | 10 | 127.29737 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_9 | 9 | 131.58856 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hour | 8 | 135.94216 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + hour | 7 | 140.00588 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + hour | 6 | 146.67190 |
| count ~ 1 + atemp + hr_17 + hr_18 + hour | 5 | 153.72401 |
| count ~ 1 + atemp + hr_17 + hour | 4 | 161.07305 |
| count ~ 1 + atemp + hour | 3 | 171.59882 |
| count ~ 1 | 1 | 181.28396 |
| count ~ 1 + hour | 2 | 192.81263 |
The formula with 30 terms appears to lower the RMSE the most, but after about 22 terms, the RMSE drops much more slowly.
This relationship can be plotted with a scatterplot of RMSE versus the best formula for each number of features:
ggplot(results, aes(x = terms, y = RMSE)) +
geom_point() +
geom_vline(xintercept = 30, color = "red") +
geom_vline(xintercept = 22, color = "blue") +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "RMSE Estimate by Number of Features", y = "RMSE Estimate", x = "Number of features in formula")An elbow of this plot at 22 is marked with a blue vertical line and the model with the best RMSE is marked with a red vertical line.
To avoid overfitting, both models will be considered as overfitting often occurs if only the best RMSE estimate for the training data is chosen.
rmse_greedy_error <- colMeans(validate_glm_nb(klist_train, count ~ atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + mist + workingday)) # 98.9
rmse_greedy_elbow <- colMeans(validate_glm_nb(klist_train, count ~ atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + weekendday)) # 102These models have an estimated RMSE of 102 bikes for the elbow model with 22 features and 98.9 for the best model with 30 features.
Finally, both models are fit to compare later.
fit_gr_elb <- glm.nb(count ~ atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + weekendday, train)
fit_gr <- glm.nb(count ~ atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + mist + workingday, train)The final modeling technique included in this analysis is to preprocess the data with principal components analysis (PCA). PCA requires a threshold value for the amount of variability to capture. Thus, an estimation of RMSE will be used to find the best threshold.
train_pca_dat <- cbind(train, ns(train$month, 2)[,],ns(train$humidity, 2)[,], ns(train$windspeed, 2)[,])
registerDoMC()
rmses <- foreach(i = seq(from = 0.90, to = 0.99, by = 0.01), .packages = c("caret", "MASS"), .combine='c') %dopar%
{
# Create preProcess object using pca method with threshold for variance
train_pca <- preProcess(train_pca_dat, method = "pca", thresh = i)
# Create the new dataset with pca variables from the preProcess object
train_pca <- predict(train_pca, train_pca_dat)
# Add the data you want to predict from training set
train_pca$count <- train_pca_dat$count
klist_pca <- klist(train_pca)
eval_pca <- colMeans(validate_glm_nb(klist_pca, count ~ .))
return(eval_pca)
}
output <- data.frame(thresh = seq(from = 0.90, to = 0.99, by = 0.01), KFOLD_RMSE = NA)
output$KFOLD_RMSE <- rmses
pandoc.table(output)| thresh | KFOLD_RMSE |
|---|---|
| 0.9 | 114.4 |
| 0.91 | 114.4 |
| 0.92 | 105.7 |
| 0.93 | 105.7 |
| 0.94 | 105.7 |
| 0.95 | 105.4 |
| 0.96 | 105.4 |
| 0.97 | 98.09 |
| 0.98 | 98.55 |
| 0.99 | 98.96 |
ggplot(output, aes(x = thresh, y = KFOLD_RMSE)) +
geom_point() +
geom_vline(xintercept = 0.97) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "RMSE Estimate by PCA Threshold", y = "RMSE Estimate", x = "PCA Threshold")A threshold of 0.97 yields the lowest RMSE value.
The final PCA is run and RMSE estimated.
# Create preProcess object using pca method
train_pca_r <- preProcess(train_pca_dat, method = "pca", thresh = 0.97)
# Create the new dataset with pca variables from the preProcess object
train_pca <- predict(train_pca_r, train_pca_dat)
# Add the data you want to predict from training set
train_pca$count <- train_pca_dat$count
klist_pca <- klist(train_pca)
rmse_pca_glm <- colMeans(validate_glm_nb(klist_pca, count ~ ., parallel = TRUE)) #98.1This model has an estimated RMSE of 98.1 bikes.
The PCA model is fit:
fit_pca_glm <- glm.nb(count ~ ., train_pca)The compare the four models to select the best option, the models will be validated by the test samples and the RMSE calculated for the out-of-sample predictions.
## Basic
pred1 <- predict(fit1, test, type = "response")
resid1 <- test$count - pred1
rmse_test_basic <- sqrt(mean(resid1^2))
## PCA glm
test_pca_dat <- cbind(test, ns(test$month, 2)[,], ns(test$humidity, 2)[,], ns(test$windspeed, 2)[,])
test_pca <- predict(train_pca_r, test_pca_dat)
test_pca$count <- test$count
pred2 <- predict(fit_pca_glm, test_pca, type = "response")
resid2 <- test$count - pred2
rmse_test_pca_glm <- sqrt(mean(resid2^2))
## Greedy Elbow
pred3 <- predict(fit_gr_elb, test, type = "response")
resid3 <- test$count - pred3
rmse_test_gr_elb <- sqrt(mean(resid3^2))
## Greedy Error
pred4 <- predict(fit_gr, test, type = "response")
resid4 <- test$count - pred4
rmse_test_gr <- sqrt(mean(resid4^2))The top two models for out-of-sample RMSE are the PCA model with an RMSE of 96.89 bikes and the best model from the greedy forward-step selection process with an RMSE of 100.05 bikes.
Now that the models are being compared according to out-of-sample RMSEs, the results from the greedy forward-step analysis with the out-of-sample RMSE can be considered.
results <- runs %>% group_by(terms) %>% top_n(1) %>% mutate(mean_RMSE = (RMSE + test_RMSE) / 2) %>% arrange(mean_RMSE)
pandoc.table(head(results, 2))| model | terms | RMSE | test_RMSE | mean_RMSE |
|---|---|---|---|---|
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + workingday | 29 | 98.92604 | 99.95026 | 99.43815 |
| count ~ 1 + atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + mist + workingday | 30 | 98.85100 | 100.04631 | 99.44866 |
Now, the model with 29 features has the lowest average within-sample and out-of-sample RMSE values.
The relationship between the number of features and the out-of-sample RMSE is plotted below:
ggplot(results, aes(x = terms, y = RMSE)) +
geom_point() +
geom_point(aes(y = test_RMSE), color = "blue") +
geom_vline(xintercept = 29, color = "red") +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "RMSE Estimate by Number of Features", y = "RMSE Estimate", x = "Number of features")fit_gr <- glm.nb(count ~ atemp + hr_17 + hr_18 + year + ns(humidity, 2) + winter + hr_8 + hr_4 + hr_3 + hr_2 + hr_5 + hr_1 + hr_0 + hr_19 + spring + hr_6 + hr_23 + hr_16 + hr_22 + lt_precip + hr_10 + hr_21 + hr_11 + ns(month, 2) + summer + day + day_week + workingday, train)
pred4 <- predict(fit_gr, test, type = "response")
resid4 <- test$count - pred4
rmse_test_gr <- sqrt(mean(resid4^2))The blue dots represent the out-of-sample RMSE and the black dot is the within-sample RMSE. At 29 features the two errors are the lowest on average. Above 29 features, the two errors begin to diverge.
To make the final model selection, the actual count values versus each model’s predictions are plotted to compare each model’s fit. In addition, the residual versus fitted plots are also included to compare how each model is fitting the data.
pred_train_pca <- predict(fit_pca_glm, type = "response")
resid_train_pca <- train$count - pred_train_pca
pca_fit <- ggplot(test, aes(x = pred2, y = count)) +
geom_point(data=train, aes(x = pred_train_pca, y = count), alpha = 0.4, color = "#00BFC4") +
geom_point(col = "#F8766D", alpha = 0.4) +
geom_abline(slope=1, intercept=0, col = "black") +
ylim(c(0,1000)) +
xlim(c(0,1500)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Actual vs Fitted Plot of the PCA Model", y = "Actual", x = "Fitted")
pca_resid <- ggplot(test, aes(x = pred2, y = resid2)) +
geom_point(data = train, aes(x = pred_train_pca, y = resid_train_pca), color = "#00BFC4", alpha = 0.4) +
stat_smooth(method="lm", color = "red", formula = y ~ ns(x,6), se = FALSE) +
geom_point(col = "#F8766D", alpha = 0.4) +
geom_hline(yintercept = 0) +
ylim(c(-600,600)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs Fitted Plot of the PCA Model", y = "Residuals", x = "Fitted")
pred_train <- predict(fit_gr, type = "response")
resid_train <- train$count - pred_train
gr_fit <- ggplot(test, aes(x = pred4, y = count)) +
geom_point(data= train, aes(x = pred_train, y = count), alpha = 0.4, color = "#00BFC4") +
geom_point(col = "#F8766D", alpha = 0.4) +
geom_abline(slope=1, intercept=0) +
ylim(c(0,1000)) +
xlim(c(0,1500)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Actual vs Fitted Plot of Best Greedy Model", y = "Actual", x = "Fitted")
gr_resid <- ggplot(test, aes(x = pred4, y = resid4)) +
geom_point(data = train, aes(x = pred_train, y = resid_train), color = "#00BFC4", alpha = 0.4) +
stat_smooth(method="lm", color = "red", formula = y ~ ns(x,6), se = FALSE) +
geom_point(col = "#F8766D", alpha = 0.4) + geom_hline(yintercept = 0) +
ylim(c(-600,600)) +
xlim(c(0, 800)) +
theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
labs(title = "Residuals vs Fitted Plot of Best Greedy Model", y = "Residuals", x = "Fitted")
grid.arrange(pca_fit, pca_resid, gr_fit, gr_resid, ncol = 2)In this plot, the within-sample predictions are plotted in turquoise and the out-of-sample predictions are plotted in red. Both models have similar distributions of residuals for both prediction types signifying that the models having external validity for use in predicting new data.
In terms of each plots bias and variance, the PCA model has superior bias and variance. The residuals for the PCA model have more even variance as the value of the predicted counts increases. Moreover, for predicted counts of less than 500 bikes, the model has little bias. However, above 500 predicted bikes, the model tends to overestimate the number of bikes rented, with consistent variance. The greedy model, on the other hand, is very heteroscedastic and variance increases greatly as the predicted counts increases.
To get a sense of the residual distribution of both models, summary statistic as calculated for the residuals:
## PCA
test_pca_dat <- cbind(test, ns(test$hour, 12)[,], ns(test$month, 2)[,], ns(test$temp, 7)[,], ns(test$atemp, 7)[,],ns(test$humidity, 2)[,], ns(test$windspeed, 2)[,])
test_pca <- predict(train_pca_r, test_pca_dat)
test_pca$count <- test$count
pred3 <- predict(fit_pca_glm, test_pca, type = "response")
resid3 <- test$count - pred3
rmse_test_pca <- sqrt(mean(resid3^2))
# Summary table of n, mean, and variance by factor group
sum_table <- as.data.frame(resid3) %>% summarize(Q1 = quantile(resid3, 0.25), MEDIAN = median(resid3), MEAN = mean(resid3), Q3 = quantile(resid3, 0.75), IQR = IQR(resid3), STDEV = sd(resid3)) %>%
mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT"))
pandoc.table(sum_table)| Q1 | MEDIAN | MEAN | Q3 | IQR | STDEV | SKEW |
|---|---|---|---|---|---|---|
| -14.73 | 2.6 | -0.9179 | 35.2 | 49.93 | 86.08 | LEFT |
For the PCA model, the average error is -0.92 bikes with a standard deviation of 86 bikes.
For contrast, the greedy model’s summary statistics are calculated.
# Summary table of n, mean, and variance by factor group
sum_table <- as.data.frame(resid4) %>% summarize(Q1 = quantile(resid4, 0.25), MEDIAN = median(resid4), MEAN = mean(resid4), Q3 = quantile(resid4, 0.75), IQR = IQR(resid4), STDEV = sd(resid4)) %>%
mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT"))
pandoc.table(sum_table)| Q1 | MEDIAN | MEAN | Q3 | IQR | STDEV | SKEW |
|---|---|---|---|---|---|---|
| -33.25 | -2.5 | -4.113 | 35.8 | 69.05 | 99.88 | LEFT |
The greedy model’s mean residual is farther from zero than the PCA model and the standard deviation is larger for the greedy model at 99.88 bikes.
The residuals are plotted side by side to compare them visually.
box_pca <- ggplot(as.data.frame(resid3),aes(x = 1, y = resid3)) + geom_boxplot() + ylim(c(-500,500))
den_pca <- ggplot(as.data.frame(resid3),aes(x = resid3)) + geom_density() + coord_flip() + xlim(c(-500,500))
box_gr <-ggplot(as.data.frame(resid4),aes(x = 1, y = resid4)) + geom_boxplot() + ylim(c(-500,500))
den_gr <-ggplot(as.data.frame(resid4),aes(x = resid4)) + geom_density() + coord_flip() + xlim(c(-500,500))
grid.arrange(box_pca,den_pca, box_gr, den_gr, ncol = 4)The plots show a smaller variance in residuals for the PCA model which demonstrates its lower error.
Since the PCA model has a better distribution of residuals and smaller residuals on average, it is selected as the final model. It achieves a within-sample RMSE of 98.1 bikes and an out-of-sample RMSE of 96.9 bikes. This error rate is smaller than the naive linear model error rate of 147.8 bikes, and reduction in prediction error of about 35%. One issue that remains with this model is its overestimation of counts at high counts. Specifically, anyone using this model would have to be more skeptical when bike rental count predictions are above 500 bikes as the higher the prediction, the more likely the prediction will overestimate the true count.
While the PCA model is indeed an improved model over a naive linear model, other modeling techniques such as random forests and boosting could likely achieve better prediction accuracy.